[MetaxGPU][feature] Add MACA GEMM compiler-path layout support#90
[MetaxGPU][feature] Add MACA GEMM compiler-path layout support#90VitalyAnkh wants to merge 4 commits into
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds MACA GEMM support end-to-end: new A/B layout factory and fragment refactor, TVM/TIR builtin intrinsics and Python/FFI bindings, device-side barrier and cp.async helpers, GEMM template updates including a WSM kernel, codegen emission and cp-async sizing, lowering helpers for permuted layouts, k-packing in MMA lowering, and tests. ChangesMACA GEMM Architecture and Intrinsics
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/tileop/gemm/gemm_maca_mma.py (2)
129-130: 💤 Low valueReplace lambda assignments with
defstatements per linter guidance.Static analysis (E731) flagged these lambda assignments. Using
defis more idiomatic Python.Proposed fix
if use_template and self.is_gemm_ss(): - shared_layout_a = lambda buf: make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2) - shared_layout_b = lambda buf: make_maca_gemm_ab_layout(buf, 2 if self.trans_B else 1) + def shared_layout_a(buf): + return make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2) + + def shared_layout_b(buf): + return make_maca_gemm_ab_layout(buf, 2 if self.trans_B else 1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/tileop/gemm/gemm_maca_mma.py` around lines 129 - 130, The two lambda assignments shared_layout_a and shared_layout_b should be replaced with proper function definitions to satisfy the linter (E731); define functions (e.g. def shared_layout_a(buf): return make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2) and def shared_layout_b(buf): return make_maca_gemm_ab_layout(buf, 2 if self.trans_B else 1)) so callers remain unchanged and references to make_maca_gemm_ab_layout, self.trans_A and self.trans_B are preserved.
28-35: 💤 Low valueConsider wrapping the int conversion for a clearer error message.
If
TILELANG_MACA_GEMM_K_PACKcontains non-numeric content,int(value)raises a genericValueError. Wrapping would provide a clearer diagnostic.Proposed fix
def _get_maca_gemm_k_pack(default: int = 1) -> int: value = os.environ.get("TILELANG_MACA_GEMM_K_PACK") if value is None: return default - k_pack = int(value) + try: + k_pack = int(value) + except ValueError: + raise ValueError( + f"TILELANG_MACA_GEMM_K_PACK must be an integer, got {value!r}" + ) from None if k_pack < 1: raise ValueError(f"TILELANG_MACA_GEMM_K_PACK must be >= 1, got {k_pack}") return k_pack🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/tileop/gemm/gemm_maca_mma.py` around lines 28 - 35, The integer conversion in _get_maca_gemm_k_pack currently calls int(value) directly so non-numeric env values raise a generic ValueError; wrap the conversion in a try/except around int(value) (or name it raw and then parse) and on exception raise a clearer ValueError that includes the env var name TILELANG_MACA_GEMM_K_PACK and the offending value (optionally include the original exception message) before the existing k_pack < 1 check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tl_templates/maca/barrier.h`:
- Around line 42-47: The inline asm in the synchronization primitives
(mbarrier_init, mbarrier_try_wait, mbarrier_wait, mbarrier_arrive (both
overloads), mbarrier_expect_tx, mbarrier_arrive_expect_tx,
mbarrier_cp_async_arrive, mbarrier_cp_async_arrive_noinc, fence_proxy_async,
fence_barrier_init) must declare the "memory" clobber so the compiler cannot
reorder loads/stores across these asm volatile blocks; update each asm
volatile(...) invocation to include "memory" in the clobber list (preserving
existing input/output constraints such as the "r" operands and any existing
clobbers) so the asm acts as a compiler-level memory barrier.
---
Nitpick comments:
In `@tilelang/tileop/gemm/gemm_maca_mma.py`:
- Around line 129-130: The two lambda assignments shared_layout_a and
shared_layout_b should be replaced with proper function definitions to satisfy
the linter (E731); define functions (e.g. def shared_layout_a(buf): return
make_maca_gemm_ab_layout(buf, 1 if self.trans_A else 2) and def
shared_layout_b(buf): return make_maca_gemm_ab_layout(buf, 2 if self.trans_B
else 1)) so callers remain unchanged and references to make_maca_gemm_ab_layout,
self.trans_A and self.trans_B are preserved.
- Around line 28-35: The integer conversion in _get_maca_gemm_k_pack currently
calls int(value) directly so non-numeric env values raise a generic ValueError;
wrap the conversion in a try/except around int(value) (or name it raw and then
parse) and on exception raise a clearer ValueError that includes the env var
name TILELANG_MACA_GEMM_K_PACK and the offending value (optionally include the
original exception message) before the existing k_pack < 1 check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a73a170a-a045-4edd-b421-4eab667c7d4e
📒 Files selected for processing (19)
src/layout/gemm_layouts.ccsrc/layout/layout.ccsrc/layout/layout.hsrc/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_maca.ccsrc/tl_templates/maca/barrier.hsrc/tl_templates/maca/common.hsrc/tl_templates/maca/copy.hsrc/tl_templates/maca/gemm.htesting/maca/language/test_tilelang_language_access_ptr_codegen.pytesting/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.pytilelang/contrib/mxcc.pytilelang/language/tir/ir.pytilelang/language/tir/ir.pyitilelang/language/tir/op.pytilelang/layout/__init__.pytilelang/layout/swizzle.pytilelang/tileop/gemm/gemm_maca_mma.py
f1a3a0d to
c23ba70
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tilelang/maca/op/gemm/gemm_mma.py (1)
306-318:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical:
T.clear(C_buf)inside the loop will discard accumulated results.In
_gemm_srr, theT.clear(C_buf)is placed inside thefor kiloop. This means the accumulator will be cleared on every iteration, discarding all previously computed partial products. This differs from_gemm_ssrand_gemm_rsrwhere the clear is correctly placed before the loop.Proposed fix
`@T.prim_func` def _gemm_srr() -> None: ... A_local = T.alloc_local((warp_rows * local_size_a * k_pack), in_dtype) + if clear_accum: + T.clear(C_buf) for ki in T.serial(0, (block_K // macro_size_k)): - if clear_accum: - T.clear(C_buf) # Load A into fragment mma_emitter.ldmatrix_a(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 306 - 318, The accumulator clear call is inside the ki loop in _gemm_srr which resets C_buf every iteration and discards partial products; move the conditional clear_accum / T.clear(C_buf) out of the for ki in T.serial(0, (block_K // macro_size_k)) loop so C_buf is cleared once before the loop begins (preserve the existing clear_accum boolean check), leaving the subsequent mma_emitter.ldmatrix_a(...) and mma_emitter.mma(...) calls unchanged.
🧹 Nitpick comments (5)
tilelang/intrinsics/tcgen05_macro_generator.py (1)
278-304: 💤 Low valueConsider extracting duplicated
access_ptr_fromhelper.The
access_ptr_fromfunction is duplicated verbatim in bothtcgen05mma_ssandtcgen05mma_ts. Extract it as a class method or module-level function to reduce duplication.Also applies to: 473-499
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/intrinsics/tcgen05_macro_generator.py` around lines 278 - 304, The duplicated helper access_ptr_from (used in tcgen05mma_ss and tcgen05mma_ts) should be extracted to a single module-level function (or a shared class method) so both generators call the same implementation; create a top-level def access_ptr_from(buffer_or_load_or_region, access_type="r") that preserves the existing logic (Buffer, BufferLoad handling with offset/stride computation, BufferRegion handling, and the same error raising), replace the duplicated blocks in tcgen05mma_ss and tcgen05mma_ts with calls to this new function, and ensure any local names (Buffer, BufferLoad, BufferRegion, tvm) remain in scope or are imported so behavior is unchanged.tilelang/intrinsics/mma_sp_layout.py (1)
147-154: 💤 Low valueType hint for
dtypeis inconsistent with implementation.The type hint
dtype: Literal["float16", "int8"]doesn't match the actual implementation which handlesdtype_bits == 32(line 157). Consider updating the type hint to include "float32" or use a more general type.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/intrinsics/mma_sp_layout.py` around lines 147 - 154, The dtype type hint on get_ldmatrix_offset_b is incorrect for the implemented logic: update the annotation for the dtype parameter in get_ldmatrix_offset_b to include "float32" (e.g., Literal["float16","int8","float32"]) or replace the Literal with a broader type (like str or an enum) so it matches the branch that checks for dtype_bits == 32 and the float32 handling in the function body.tilelang/intrinsics/wgmma_macro_generator.py (2)
500-502: 💤 Low valueSimplify redundant boolean logic.
The current logic with
[False]prepended andany()is equivalent to just checkingnot transposed. The[False]has no effect on theany()result.Proposed simplification
- is_sr_conditions = [False] - is_sr_conditions.append(not transposed) - is_sr_axis_order = any(is_sr_conditions) + is_sr_axis_order = not transposed🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/intrinsics/wgmma_macro_generator.py` around lines 500 - 502, Replace the redundant construction of is_sr_conditions and the any() call with a direct boolean based on transposed: the current block that creates is_sr_conditions = [False]; is_sr_conditions.append(not transposed); is_sr_axis_order = any(is_sr_conditions) should be simplified to set is_sr_axis_order directly from not transposed (i.e., is_sr_axis_order = not transposed) so remove the temporary list and any() usage; update any dependent logic/comments in the function wgmma_macro_generator.py to reflect the simplified variable.
165-176: 💤 Low valueTypo in method name:
_determinate_swizzle_modeshould be_determine_swizzle_mode.Proposed fix
- def _determinate_swizzle_mode(self, buffer: Buffer, layout: Layout) -> SwizzleMode: + def _determine_swizzle_mode(self, buffer: Buffer, layout: Layout) -> SwizzleMode:Also update the call sites at lines 202, 203, and 363.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/intrinsics/wgmma_macro_generator.py` around lines 165 - 176, Rename the misspelled method `_determinate_swizzle_mode` to `_determine_swizzle_mode` and update every call site that references `_determinate_swizzle_mode` to use `_determine_swizzle_mode` instead; specifically change the function definition and all invocations (e.g., the calls currently referencing `_determinate_swizzle_mode` around where swizzle mode is resolved) to the corrected name so references (and any imports/uses within the same module) remain consistent and tests/imports continue to work.tilelang/intrinsics/wmma_macro_generator.py (1)
196-212: 💤 Low valueUnused macro parameter
A_shared_buf.The inner macro
_warp_ldmatrix_adeclaresA_shared_bufas a parameter but never uses it. The actual buffer access usesA_buffrom the outer scope. Consider removing the unused parameter or using it consistently.Proposed fix
`@T.macro` - def _warp_ldmatrix_a(A_local_buf, A_shared_buf, ki, thread_binding, rk=0): + def _warp_ldmatrix_a(A_local_buf, ki, thread_binding, rk=0): tx, _, warp_m = self.extract_thread_binding(thread_binding) ... - return _warp_ldmatrix_a(A_local_buf, A_shared_buf, ki, thread_binding, rk) + return _warp_ldmatrix_a(A_local_buf, ki, thread_binding, rk)Same applies to
_warp_ldmatrix_bat lines 236-251.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/intrinsics/wmma_macro_generator.py` around lines 196 - 212, The macro parameter A_shared_buf in _warp_ldmatrix_a is unused (the body reads A_buf from outer scope); either remove A_shared_buf from the macro signature and from its invocation (return line) or change the macro body to consistently use A_shared_buf instead of A_buf; apply the same fix for the symmetric macro _warp_ldmatrix_b (remove unused B_shared_buf or switch uses to B_shared_buf) and update any calls to _warp_ldmatrix_a/_warp_ldmatrix_b accordingly so signatures match their invocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tilelang/intrinsics/maca_mma_macro_generator.py`:
- Around line 500-502: The ValueError in the else branch incorrectly states
"k_dim must be 0" while valid k_dim values handled are 4, 16, and 32; update the
raise in tilelang/intrinsics/maca_mma_macro_generator.py (the branch that
currently raises ValueError(f"k_dim must be 0 currently but got {k_dim}")) to a
clearer message that lists the allowed values and shows the actual value, e.g.
raise ValueError(f"k_dim must be one of {{4, 16, 32}} but got {k_dim}").
- Line 782: Remove the debugging print statement that emits kernel compilation
noise: delete the call to print(self.a_preshuffle) in the MacaMmaMacroGenerator
(or the method where self.a_preshuffle is used) so stdout is not polluted during
kernel compilation; ensure no other side-effects rely on that print and run
tests to confirm behavior unchanged.
In `@tilelang/intrinsics/mfma_layout.py`:
- Around line 113-128: The function shared_16x64_to_local_64x16_layout_B is a
copy of the _A variant and uses the wrong index mapping; replace its body so it
follows the established _B pattern: compute thread_id as j + (i // 16) * 16 and
local as i % 16, then return thread_id, local (update
shared_16x64_to_local_64x16_layout_B accordingly to match the intended mapping
and be consistent with shared_16x64_to_local_64x16_layout_A and other *_B
variants).
In `@tilelang/intrinsics/mfma_macro_generator.py`:
- Line 835: Remove the leftover debug print that emits self.a_preshuffle in
mfma_macro_generator.py: locate the print(self.a_preshuffle) call (inside the
MFMA macro generation code where self.a_preshuffle is referenced) and delete it;
if silent diagnostics are needed instead, replace it with a debug-level logging
call (e.g., using the module/class logger) rather than printing to stdout.
In `@tilelang/intrinsics/tcgen05_macro_generator.py`:
- Around line 22-27: The SwizzleMode IntEnum currently has incorrect numeric
mappings; update the SwizzleMode enum in tcgen05_macro_generator.py so the
members match the TCGen05 PTX spec: set NONE = 0, SWIZZLE_32B = 1, SWIZZLE_64B =
2, and SWIZZLE_128B = 3 (these correspond to bits 61–63 of the shared-memory
descriptor). Modify the class SwizzleMode to use these exact integer values so
code referencing SwizzleMode uses the correct PTX swizzle-mode encodings.
In `@tilelang/intrinsics/utils.py`:
- Around line 96-116: The function get_mma_micro_size currently types its dtype
as Literal["float16", "int8"] but the implementation and docstring also accept
"float8_e4m3" and "float8_e5m2"; update the type hint on get_mma_micro_size to
include those FP8 literals (e.g.,
Literal["float16","int8","float8_e4m3","float8_e5m2"]) or broaden to
str/Union[...] so the annotation matches the handled values, and ensure the
docstring stays consistent with the new annotation.
---
Outside diff comments:
In `@tilelang/maca/op/gemm/gemm_mma.py`:
- Around line 306-318: The accumulator clear call is inside the ki loop in
_gemm_srr which resets C_buf every iteration and discards partial products; move
the conditional clear_accum / T.clear(C_buf) out of the for ki in T.serial(0,
(block_K // macro_size_k)) loop so C_buf is cleared once before the loop begins
(preserve the existing clear_accum boolean check), leaving the subsequent
mma_emitter.ldmatrix_a(...) and mma_emitter.mma(...) calls unchanged.
---
Nitpick comments:
In `@tilelang/intrinsics/mma_sp_layout.py`:
- Around line 147-154: The dtype type hint on get_ldmatrix_offset_b is incorrect
for the implemented logic: update the annotation for the dtype parameter in
get_ldmatrix_offset_b to include "float32" (e.g.,
Literal["float16","int8","float32"]) or replace the Literal with a broader type
(like str or an enum) so it matches the branch that checks for dtype_bits == 32
and the float32 handling in the function body.
In `@tilelang/intrinsics/tcgen05_macro_generator.py`:
- Around line 278-304: The duplicated helper access_ptr_from (used in
tcgen05mma_ss and tcgen05mma_ts) should be extracted to a single module-level
function (or a shared class method) so both generators call the same
implementation; create a top-level def access_ptr_from(buffer_or_load_or_region,
access_type="r") that preserves the existing logic (Buffer, BufferLoad handling
with offset/stride computation, BufferRegion handling, and the same error
raising), replace the duplicated blocks in tcgen05mma_ss and tcgen05mma_ts with
calls to this new function, and ensure any local names (Buffer, BufferLoad,
BufferRegion, tvm) remain in scope or are imported so behavior is unchanged.
In `@tilelang/intrinsics/wgmma_macro_generator.py`:
- Around line 500-502: Replace the redundant construction of is_sr_conditions
and the any() call with a direct boolean based on transposed: the current block
that creates is_sr_conditions = [False]; is_sr_conditions.append(not
transposed); is_sr_axis_order = any(is_sr_conditions) should be simplified to
set is_sr_axis_order directly from not transposed (i.e., is_sr_axis_order = not
transposed) so remove the temporary list and any() usage; update any dependent
logic/comments in the function wgmma_macro_generator.py to reflect the
simplified variable.
- Around line 165-176: Rename the misspelled method `_determinate_swizzle_mode`
to `_determine_swizzle_mode` and update every call site that references
`_determinate_swizzle_mode` to use `_determine_swizzle_mode` instead;
specifically change the function definition and all invocations (e.g., the calls
currently referencing `_determinate_swizzle_mode` around where swizzle mode is
resolved) to the corrected name so references (and any imports/uses within the
same module) remain consistent and tests/imports continue to work.
In `@tilelang/intrinsics/wmma_macro_generator.py`:
- Around line 196-212: The macro parameter A_shared_buf in _warp_ldmatrix_a is
unused (the body reads A_buf from outer scope); either remove A_shared_buf from
the macro signature and from its invocation (return line) or change the macro
body to consistently use A_shared_buf instead of A_buf; apply the same fix for
the symmetric macro _warp_ldmatrix_b (remove unused B_shared_buf or switch uses
to B_shared_buf) and update any calls to _warp_ldmatrix_a/_warp_ldmatrix_b
accordingly so signatures match their invocations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 05f94a50-4354-446b-99ad-ef5a91391483
📒 Files selected for processing (22)
src/target/codegen_maca.ccsrc/tl_templates/maca/barrier.hsrc/tl_templates/maca/copy.hsrc/tl_templates/maca/gemm.htesting/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.pytilelang/intrinsics/__init__.pytilelang/intrinsics/maca_mma_macro_generator.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/intrinsics/mma_layout.pytilelang/intrinsics/mma_macro_generator.pytilelang/intrinsics/mma_sm70_layout.pytilelang/intrinsics/mma_sm70_macro_generator.pytilelang/intrinsics/mma_sp_layout.pytilelang/intrinsics/mma_sp_macro_generator.pytilelang/intrinsics/tcgen05_macro_generator.pytilelang/intrinsics/utils.pytilelang/intrinsics/wgmma_macro_generator.pytilelang/intrinsics/wmma_layout.pytilelang/intrinsics/wmma_macro_generator.pytilelang/layout/swizzle.pytilelang/maca/op/gemm/gemm_mma.py
e8a7191 to
17a2e8f
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
tilelang/contrib/mxcc.py (1)
20-44: ⚡ Quick winDocument the
TILELANG_MXCC_FLAGSenvironment variable.The function docstring doesn't mention that additional compiler flags can be injected via the
TILELANG_MXCC_FLAGSenvironment variable. Users won't discover this feature without reading the implementation.📝 Proposed docstring update
def compile_maca(code, target_format="mcbin", arch=None, options=None, path_target=None, verbose=False): """Compile maca code with MXCC from env. Parameters ---------- code : str The maca code. target_format : str The target format of mxcc compiler. arch : str The maca architecture. options : str or list of str The additional options. path_target : str, optional Output file. + Environment Variables + --------------------- + TILELANG_MXCC_FLAGS : str, optional + Additional MXCC command-line flags, parsed with shell-like syntax. + These flags are appended after explicit options but before output specs. + Return ------ cubin : bytearray The bytearray of the fatbin """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/contrib/mxcc.py` around lines 20 - 44, The docstring for compile_maca is missing documentation for the TILELANG_MXCC_FLAGS environment variable; update the compile_maca function docstring to mention that additional MXCC compiler flags can be supplied via the TILELANG_MXCC_FLAGS environment variable (e.g., a space-separated string or list-equivalent), describe its effect on the options passed to mxcc, and include an example or note about precedence/format so users can discover and use this feature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/target/codegen_maca.cc`:
- Around line 1666-1685: The code emits PrintExpr(op->args[2]) as a template
parameter for tl::cp_async_gs, which can produce invalid C++ if the size isn't a
compile-time IntImm; update the ptx_cp_async handling (the block guarded by
op->op.same_as(builtin::ptx_cp_async())) to first normalize/validate op->args[2]
as an IntImm using the same helper used elsewhere (call
GetTileLangCPAsyncTransferBytes on op->args[2] to obtain a validated integer
byte count), then use that integer value (as a literal string) when emitting
both tl::cp_async_gs<...>(...) and tl::cp_async_gs_conditional<...>(...), and
emit a clear error/check if GetTileLangCPAsyncTransferBytes fails; apply the
same pattern consistently wherever ptx_cp_async is handled (e.g., the other
backends).
In `@src/tl_templates/maca/barrier.h`:
- Around line 68-75: The inline assembly in the __forceinline__ barrier helper
defines non-unique labels LAB_WAIT and DONE which will clash when the function
is instantiated in multiple translation units; update the labels inside the asm
block (the ones referenced by the branch instructions in the mbarrier loop) to
use unique local labels such as LAB_WAIT_%= and DONE_%= (or switch to numeric
local labels like 1: with 1b/1f references) so each inlined instance gets
distinct label names and avoids duplicate-label assembler errors.
In `@src/tl_templates/maca/gemm_wsm.h`:
- Around line 76-83: The code always seeds C_f32 from accum, ignoring the
template flag clear_accum; change the logic in the gemm_ss_wsm routine so that
before the current loop that assigns C_f32 from accum you check clear_accum and
if true initialize C_f32 to zero (or skip loading accum entirely), otherwise
proceed to load accum into C_f32 as currently done; reference the C_f32 array,
accum pointer, and the clear_accum template parameter to implement the
conditional initialization so first-use outputs are not contaminated by stale
fragments.
In `@src/tl_templates/maca/gemm.h`:
- Around line 96-100: The remove_swizzle helper is checking sizeof(A_type) but
is used for the B fragment (e.g., called with tCrB.layout()), which strips the
wrong layout when A and B have different widths; change the constexpr condition
to inspect sizeof(B_type) instead of sizeof(A_type) so that
ComposedLayout<Args...> const &layout returns layout.layout_b() only when B_type
is 2 bytes (and otherwise returns layout), updating the remove_swizzle
implementation accordingly to reference B_type for the swizzle decision so
gemm() receives the correct B layout.
In `@testing/maca/language/test_tilelang_language_access_ptr_codegen.py`:
- Around line 168-169: The two MACA tests (e.g., the test function
test_maca_bsm_intrinsics_codegen and the similar test at lines ~206-207) are
unconditionally running; gate them with the project’s MACA availability marker
so they are skipped when MACA isn’t present. Locate the test definitions
(test_maca_bsm_intrinsics_codegen and the other MACA-targeted test) and add the
same pytest marker used elsewhere in this file (e.g., the MACA availability
skip/marker) so the tests are automatically skipped when the project’s MACA
flag/marker indicates MACA is unavailable.
In `@tilelang/contrib/mxcc.py`:
- Around line 92-94: Handle malformed TILELANG_MXCC_FLAGS by catching
shlex.split ValueError around the call that appends to cmd: when reading
extra_env_flags and calling shlex.split(extra_env_flags) (the code that mutates
cmd using variables cmd and extra_env_flags), wrap the split in try/except
ValueError and raise a clear, user-facing error (or exit with a descriptive
message) that includes the original exception text and the offending
extra_env_flags value so users see "malformed TILELANG_MXCC_FLAGS: <details>"
instead of a raw traceback.
In `@tilelang/language/gemm_op.py`:
- Around line 159-160: The gemm function added a new public parameter
annotations but the docstring Args: section for gemm does not mention it; update
the gemm docstring to document the annotations parameter (type, purpose,
default/optional behavior) alongside the existing arguments—reference the gemm
function and the annotations parameter in your description so users know what
values (e.g., dict | None) are expected and how annotations affect behavior;
keep the style consistent with the other Args entries in the existing docstring.
In `@tilelang/tileop/gemm/gemm_base.py`:
- Around line 168-170: The annotations property can return None and cause
callers like self.annotations.get(...) to crash; change the property
(annotations) to always return a dict by retrieving getattr(self.gemm_node,
"annotations", None) and normalizing it to an empty dict when falsy (or
converting to dict if needed) so callers can safely call .get() on the result.
---
Nitpick comments:
In `@tilelang/contrib/mxcc.py`:
- Around line 20-44: The docstring for compile_maca is missing documentation for
the TILELANG_MXCC_FLAGS environment variable; update the compile_maca function
docstring to mention that additional MXCC compiler flags can be supplied via the
TILELANG_MXCC_FLAGS environment variable (e.g., a space-separated string or
list-equivalent), describe its effect on the options passed to mxcc, and include
an example or note about precedence/format so users can discover and use this
feature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e95d2cba-a6d9-4555-bb17-7bf0713dd149
📒 Files selected for processing (26)
src/backend/maca/codegen/codegen_maca.ccsrc/layout/gemm_layouts.ccsrc/layout/layout.ccsrc/layout/layout.hsrc/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_maca.ccsrc/tl_templates/maca/barrier.hsrc/tl_templates/maca/common.hsrc/tl_templates/maca/copy.hsrc/tl_templates/maca/gemm.hsrc/tl_templates/maca/gemm_wsm.hsrc/transform/lower_tile_op.cctesting/maca/language/test_tilelang_language_access_ptr_codegen.pytesting/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.pytilelang/contrib/mxcc.pytilelang/language/gemm_op.pytilelang/language/tir/ir.pytilelang/language/tir/ir.pyitilelang/language/tir/op.pytilelang/layout/__init__.pytilelang/layout/swizzle.pytilelang/maca/intrinsics/layout/mma_layout.pytilelang/maca/intrinsics/macro/mma_macro_generator.pytilelang/maca/op/gemm/gemm_mma.pytilelang/tileop/gemm/gemm_base.py
✅ Files skipped from review due to trivial changes (1)
- src/tl_templates/maca/common.h
530db0f to
6f05550
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/maca/op/gemm/gemm_mma.py (2)
286-298: 💤 Low valueDocument the WSM buffer size constant.
The
0x8000(32 KB) allocation is a magic number. A named constant or brief comment would clarify why this specific size is required for the WSM workspace.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 286 - 298, The WSM workspace allocation uses a magic constant (0x8000) in _gemm_ss_wsm_template which should be clarified; replace the literal with a named constant (e.g., WSM_SIZE_BYTES or WSM_WORKSPACE_SIZE = 0x8000) or add an inline comment explaining “32KB required for WSM workspace per TL gemm backend” so readers understand the size choice and update the T.alloc_shared call to use that constant (refer to WSM and _gemm_ss_wsm_template to locate the change).
137-150: 💤 Low valueEnvironment variable is read in both
infer_layoutandlower— ensure consistency.
_get_maca_gemm_k_pack()is called at line 137 (ininfer_layout) and again at line 210 (inlower). If the environment variable changes between these calls—unlikely in practice, but possible—the layout and lowering could use differentk_packvalues, causing mismatched buffer shapes or loop bounds. Consider cachingk_packonce at construction or passing it explicitly frominfer_layouttolower.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tilelang/maca/op/gemm/gemm_mma.py` around lines 137 - 150, infer_layout calls _get_maca_gemm_k_pack(self.k_pack) and lower calls it again, risking inconsistency if the env var changes; to fix, read and cache k_pack once (e.g., store in self._cached_k_pack at construction or the first call) and replace direct calls to _get_maca_gemm_k_pack(...) in both infer_layout and lower with the cached value, or pass the resolved k_pack from infer_layout into lower via the operator state so both use the identical k_pack; update references in methods named infer_layout and lower and any helper uses like _make_maca_gemm_emitter to consume the cached/passed k_pack.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tl_templates/maca/gemm_wsm.h`:
- Around line 11-143: The template instantiation silently permits unsupported
configurations; add compile-time guards inside gemm_ss_wsm to fail fast:
static_assert that trans_A==false and trans_B==false, num_warp_m==1 and
num_warp_n==1 (the hardwired warp layout), kPack==8 (used as K/8),
AStrideElements%8==0 (used to compute lda_vec), and that the implementation
expects a 4-stage schedule (Stage==4) / 4x4 accumulators (i.e., require any
template flags you depend on to match the hardcoded schedule); reference the
function name gemm_ss_wsm and the symbols ALdgOffset, BLdgOffset, WSM_Ldg,
lda_vec, K/8, and C_f32 when adding these static_asserts so the checks are
colocated with the hardcoded logic.
---
Nitpick comments:
In `@tilelang/maca/op/gemm/gemm_mma.py`:
- Around line 286-298: The WSM workspace allocation uses a magic constant
(0x8000) in _gemm_ss_wsm_template which should be clarified; replace the literal
with a named constant (e.g., WSM_SIZE_BYTES or WSM_WORKSPACE_SIZE = 0x8000) or
add an inline comment explaining “32KB required for WSM workspace per TL gemm
backend” so readers understand the size choice and update the T.alloc_shared
call to use that constant (refer to WSM and _gemm_ss_wsm_template to locate the
change).
- Around line 137-150: infer_layout calls _get_maca_gemm_k_pack(self.k_pack) and
lower calls it again, risking inconsistency if the env var changes; to fix, read
and cache k_pack once (e.g., store in self._cached_k_pack at construction or the
first call) and replace direct calls to _get_maca_gemm_k_pack(...) in both
infer_layout and lower with the cached value, or pass the resolved k_pack from
infer_layout into lower via the operator state so both use the identical k_pack;
update references in methods named infer_layout and lower and any helper uses
like _make_maca_gemm_emitter to consume the cached/passed k_pack.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 81ab369b-85af-4a84-84d9-bb2cbf160cc2
📒 Files selected for processing (29)
src/backend/common/op/reduce.hsrc/backend/cuda/op/reduce.ccsrc/backend/maca/codegen/codegen_maca.ccsrc/backend/maca/op/reduce.ccsrc/backend/rocm/op/reduce.ccsrc/layout/gemm_layouts.ccsrc/layout/layout.ccsrc/layout/layout.hsrc/op/builtin.ccsrc/op/builtin.hsrc/target/codegen_maca.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/reduce.hsrc/tl_templates/maca/barrier.hsrc/tl_templates/maca/copy.hsrc/tl_templates/maca/gemm.hsrc/tl_templates/maca/gemm_wsm.hsrc/tl_templates/maca/reduce.hsrc/transform/lower_tile_op.cctesting/maca/language/test_tilelang_language_access_ptr_codegen.pytesting/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.pytilelang/contrib/mxcc.pytilelang/language/gemm_op.pytilelang/layout/__init__.pytilelang/layout/swizzle.pytilelang/maca/intrinsics/layout/mma_layout.pytilelang/maca/intrinsics/macro/mma_macro_generator.pytilelang/maca/op/gemm/gemm_mma.pytilelang/tileop/gemm/gemm_base.py
✅ Files skipped from review due to trivial changes (2)
- src/layout/layout.h
- tilelang/maca/intrinsics/macro/mma_macro_generator.py
6f05550 to
eba3764
Compare
|
Update: I traced the remaining MACA-3.6 failure to a stale test expectation in |
|
@regression-perf |
Performance Regression Test ReportTriggered by: @Five-HZ Results
Artifacts
|
|
@regression-perf |
7b0c0a7 to
da1471d
Compare
Expose MACA GEMM A/B and C fragment layouts to Python and wire the dense TileLang GEMM template path through MACA-specific layouts. Teach MACA codegen to emit tl_gemm calls and derive tl.ptx_cp_async byte widths from access-pointer element types, then add the template header pieces needed by the compiler path. Validation: git diff --cached --check; ./.venv/bin/python -m py_compile tilelang/layout/__init__.py tilelang/layout/swizzle.py tilelang/tileop/gemm/gemm_maca_mma.py testing/maca/tilelibrary/test_tilelang_maca_gemm_template_contract.py testing/maca/language/test_tilelang_language_access_ptr_codegen.py
da1471d to
be876e5
Compare
|
@regression-perf |
Performance Regression Test ReportTriggered by: @ventijing Results
Artifacts
|
|
Updated the branch with a targeted fix for the attention-sink backward regression. Root cause: standalone Validation:
|
Hi maintainers,
This PR restores the MACA intrinsic package that the compiler-generated hgemm path depends on and keeps the TileLang side aligned with the paired TileOps C500 compiler route. The handwritten MACA C kernel remains a reference only; the optimized path stays on the TileLang compiler and lowering route.
Summary
tilelang.intrinsics.gemm_ss, while the supported path uses an explicitly sized 4-stage workspace.Rebase update
dev.tirxand reduce batch-source changes while preserving the MACA GEMM template route.Validation
5 passed.172.076209to205.316929 TFLOPS.89.68%.Notes
tilelang.intrinsics.